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Abstract 

Most existing binary classification methods target on the optimization of the overall clas¬ 
sification risk and may fail to serve some real-world applications such as cancer diagnosis, 
where users are more concerned with the risk of misclassifying one specific class than the 
other. Neyman-Pearson (NP) paradigm was introduced in this context as a novel sta¬ 
tistical framework for handling asymmetric type I/II error priorities. It seeks classihers 
with a minimal type II error and a constrained type I error under a user specified level. 
This article is the first attempt to construct classifiers with guaranteed theoretical perfor¬ 
mance under the NP paradigm in high-dimensional settings. Based on the fundamental 
Neyman-Pearson Lemma, we used a plug-in approach to construct NP-type classifiers for 
Naive Bayes models. The proposed classifiers satisfy the NP oracle inequalities, which are 
natural NP paradigm counterparts of the oracle inequalities in classical binary classifica¬ 
tion. Besides their desirable theoretical properties, we also demonstrated their numerical 
advantages in prioritized error control via both simulation and real data studies. 
Keywords: classification, high-dimension. Naive Bayes, Neyman-Pearson (NP) paradigm, 
NP oracle inequality, plug-in approach, screening 


1. Introduction 


Classification plays an important role in many aspects of our society. In medical research, 
identifying pathogenically distinct tumor types is central to advances in cancer treatments 
(Golub et ah, 1999 Alderton, 2014). In cyber security, spam messages and virus make 
automatic categorical decisions a necessity. Binary classification is arguably the simplest 
and most important form of classification problems, and can serve as a building block for 
more complicated applications. We focus our attention on binary classification in this work. 
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A few common notations are introduced to facilitate our discussion. Let {X, Y) be a random 
pair where X £ X C IR"^ is a vector of features and Y £ {0,1} indicates X’s class label. A 
classifier iji : A —)• {0,1} is a mapping from X to {0,1} that assigns X to one of the classes. A 
classification loss function is defined to assign a “cost” to each misclassified instance fiiX) 

Y, and the classification error is dehned as the expectation of this loss function with respect 
to the joint distribution of {X,Y). We will focus our discussion on the 0-1 loss function 
1I{(/|(A) fiY} throughout the paper, where ]!(•) denotes the indicator function. Denote by 
P and E the generic probability distribution and expectation, whose meaning depends on 
specific contexts. The classification error is R{4>) = ElI{(^(Ai) fi^Yf = W {0(A) Y}. The 

law of total probability allows us to decompose it into a weighted average of type I error 
i?o(0) = P{0(A) / Y\Y = 0} and type II error i?i(0) = P{0(A) / Y\Y = 1} as 

R{fi) = P(y = O)Ro(0) + IP(A = l)i?i(0). (1.1) 


With the advent of high-throughput technologies, classification tasks have experienced 
an exponential growth in the feature dimensions throughout the past decade. The funda¬ 
mental challenge of “high dimension, low sample size” has motivated the development of 
a plethora of classification algorithms for various applications. While dependencies among 


features are usually considered a crucial characteristic of the data (Ackermann and Strim- 


mer 


2009), and can effectively reduce classification errors under suitable models and relative 


data abundance ( 

Shao et al. 

2011 

Cai and Liu 

2011 

Fan et al. 

2012 

Mai et al. 

2012 

Witten and Tibshirani 

2012 

, independence rules, with their superb scalability, become a 


rule of thumb when the feature dimension grows faster than the sample size (Hastie et al, 


2009; James et al. 2013). Despite Naive Bayes models’ reputation of being “simplistic” 


by ignoring all dependency structure among features, they lead to simple classifiers that 
have proven worthy on high-dimensional data with remarkably good performances in nu¬ 
merous real-life applications. Taking the classical model setting of two-class Gaussian with 
a common covariance matrix, Bickel and Levina (2004) showed the superior performance of 
Naive Bayes models over (naive implementation of) the Fisher linear discriminant rule un¬ 
der broad conditions in high-dimensional settings. Fan and Fan (2008) further established 
the necessity of feature selection for high-dimensional classification problems by showing 
that even independence rules can be as poor as random guessing due to noise accumulation. 
Featuring both independence rule and feature selection, the (sparse) Naive Bayes model 
remains a good choice for classification when the sample size is fairly limited. 


1.1 Asymmetrical priorities on errors 

Most existing binary classification methods target on the optimization of the overall risk 
and may fail to serve the purpose when users’ relative priorities over type I/II errors 
differ significantly from those implied by the marginal probabilities of the two classes. A 
representative example of such scenario is the diagnosis of serious disease. Let 1 code the 
healthy class and 0 code the diseased class. Given that usually 



p(y = l)>p(y = 0). 


minimizing the overall risk (1.1) might yield classifiers with small overall risk R (as a result 
of small Ri ) yet large Rq — a situation quite undesirable in practice given flagging a healthy 
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case incurs only extra cost of additional tests while failing to detect the disease endangers 
a life. 


The neuroblastoma dataset introduced by Oberthuer et al. (2006) provides a perfect 
illustration of such intuition. The dataset contains gene expression profiles on d = 10707 
genes from 246 patients in a German neuroblastoma trial, among which 56 are high-risk 
(labeled as 0) and 190 are low-risk (labeled as 1). We randomly selected 41 ‘O’s and 123 
‘I’s as our training sample (such that the proportion of ‘O’s is about the same as that in 
the entire dataset), and tested the resulting classifiers on the rest 15 ‘O’s and 67 ‘I’s. The 
average error rates of PSN^ (to be proposed; implemented here at significance level 0.05), 
Gaussian Naive Bayes (nb), penalized logistic regression (pen-log), and Support Vector 
Machine (svm) over 1000 random splits are summarized in Table All procedures except 


Table 1: Average error rates over 1000 random splits for neuroblastoma dataset. 


Error Type 

PSN^ 

nb 

pen-log 

svm 

type I (0 as 1) 

.038 

.308 

.529 

.603 

type II (1 as 0) 

.761 

.150 

.103 

.573 


PSN^ led to high type I errors, and are thus considered unsatisfactory given the more severe 
consequences of missing a diseased instance than vice versa. 

One existing solution to asymmetric error control is cost-sensitive learning, which assigns 


two different costs as weights of the type I/II errors (Elkan, 2001; Zadrozny et ah, 2003). 


Despite many merits and practical values of this framework, limitations arise in applications 
when there is no consensus over how much costs to be assigned to each class, or more 
fundamentally, whether it is morally acceptable to assign costs in the first place. Also, 
when users have a specific target for type I/II error control, cost-sensitive learning does 
not fit. Other methods aiming for small type I error include the Asymmetric Support 


Vector Machine (Wu et al., 2008), and the p-value for classification (Diimbgen et al., 2008). 


However, the former has no theoretical guarantee on errors, while the latter treats all classes 
as of equal importance. 


1.2 Neyman-Pearson (NP) paradigm and NP oracle inequalities 

Neyman-Pearson (NP) paradigm was introduced as a novel statistical framework for tar¬ 
geted type I/II error control. Assume type I error Rq as the prioritized error type, this 
paradigm seeks to control Rq under a user specified level a with Ri as small as possible. 
The oracle is thus 

4>* G argmin iii((/), (1.2) 

Ro(<t>)<a 


where the significance level a reflects the level of conservativeness towards type I error. 
Given is unattainable in the learning paradigm, the best within our capability is to 
construct a data dependent classifier (j) that mimics it. 

Despite its practical importance, NP classification has not received much attention in the 
statistics and machine learning communities. Gannon et al. (|2002) initiated the theoretical 


treatment of NP classification. Under the same framework, Scott] (2005) and Scott and 
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Nowak (2005) derived several results for traditional statistical learning such as PAC bounds 
or oracle inequalities. By combining type I and type II errors in sensible ways, Scott ( 2007[ ) 
proposed a performance measure for NP classification. More recently, Blanchard et al. 


(2010) developed a general solution to semi-supervised novelty detection by reducing it to 


NP classification. Other related works include Casasent and Chen (2003) and Han et al. 
(2008). A common issue with methods in this line of literature is that they all follow an 


empirical risk minimization (ERM) approach, and use some forms of relaxed empirical type 
I error constraint in the optimization program. As a result, all type I errors can only be 


proven to satisfy some relaxed upper bound. Take the framework set up by Cannon et al. 


(2002) for example. Given £q > 0, they proposed the program 


_ min ) 

where Ti is a set of classifiers with hnite Vapnik-Chervonenkis dimension, and Rq, Ri are 
the empirical type I and type II errors respectively. It is shown that with high probability, 
the solution (p to the above program satisfies simultaneously: i) the type I error Ro{(j)) is 
bounded from above by a + eo, and ii) the type II error Ri{<p) is bounded from above by 
Ri{(p*) + ei for some ei > 0. 


Rigollet and Tong (2011) is a significant departure from the previous NP classification 


literature. This paper argues that a good classifier </> under the NP paradigm should respect 
the chosen significance level a, rather than some relaxation of it. More precisely, two NP 
oracle inequalities should be satished simultaneously with high probability: 

(I) the type I error constraint is respected, i.e., Ro{4>) < a. 

(II) the excess type II error Ri{4>) — Ri{(p*) diminishes with explicit rates (w.r.t. sample 
size). 

Recall that, for a classifier /i, the classical oracle inequality insists that with high probability 


the excess risk R{h) — R{h*) diminishes with explicit rates. 


(1.3) 


where h*{x) = ll(r^(x) > 1/2) is the Bayes classifier, in which r]{x) = E[y|A = x] = P(y = 


1\X = x) is the regression function of y on X (see Koltchinskii (2008) and references 
within). The two NP oracle inequalities defined above can be thought of as a generalization 


of (1.3) that provides a novel characterization of classifiers’ theoretical performances under 


the NP paradigm. 


Using a more stringent empirical type I error constraint (than the level a), Rigollet 


and Tong (2011) established NP oracle inequalities for its proposed classifiers under convex 
loss functions (as opposed to the indicator loss). They also proved an interesting negative 
result: under the binary loss, ERM approaches (convexihcation or not) cannot guarantee 
diminishing excess type II error as long as one insists type I error of the proposed classifier 
be bounded from above by a with high probability. This negative result motivated a plug-in 


approach to NP classification in Tong (2013). 
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1.3 Plug-in approaches 


Plug-in methods in classical binary classification have been well studied in the literature, 
where the usual plug-in target is the Bayes classifier TL{rj{x) > 1/2). Earlier works gave 
rise to pessimism of the plug-in approach to classification. For example, under certain 


assumptions, Yang (1999) showed plug-in estimators cannot achieve excess risk with rates 


faster than 0{l/y^), while direct methods can achieve rates up to 0(l/n) under margin 


assumption (Mammen and Tsybakov 

1999 

Tsybakov 

2004 

Tsybakov and van de Geer 

2005 

Tarigan and van de Geer[ 2006) 

However, it was shown in Audibert and Tsybakov 


(2007) that plug-in classifiers ll(r)n > 1/2) based on local polynomial estimators can achieve 
rates faster than 0(l/n), with a smoothness condition on rj and the margin assumption. 

The oracle classifier under the NP paradigm arises from its close connection to the 
Neyman-Pearson Lemma in statistical hypothesis testing. Hypothesis testing bears strong 
resemblance to binary classification if we assume the following model. Let Pi and Pq be two 
known probability distributions on T C IR'^. Assume that Y ~ Bern((/) for some G (0, 1), 
and the conditional distribution of X given Y is Py. Given such a model, the goal of 
statistical hypothesis testing is to determine if we should reject the null hypothesis that X 
was generated from Pq. To this end, we construct a randomized test (/> : T —?■ [0,1] that 
rejects the null with probability (j){X). Two types of errors arise: type I error occurs when 
Pq is rejected yet A ~ Pq, and type II error occurs when Pq is not rejected yet A ~ Pi. 
The Neyman-Pearson paradigm in hypothesis testing amounts to choosing </> that solves the 
following constrained optimization problem 

maximize ]E[i/)(A)|y = 1] , subject to Wj[(j){X)\Y = 0] < a, 

where a G (0,1) is the significance level of the test. A solution to this constrained opti¬ 
mization problem is called a most powerful test of level a. The Neyman-Pearson Lemma 
gives mild sufficient conditions for the existence of such a test. 

Lemma 1.1 (Neyman-Pearson Lemma). Let Pi and Pq be two probability measures with 
densities p and q respectively, and denote the density ratio as r{x) = p{x) / q{x). For a given 
significance level a, let Ca be such that Po{r(A) > Ca} < a and Po{r(A) > Ca} > ot. 
Then, the most powerful test of level a is 


fX) = 


1 

0 

a-Po{riX)>Ca,} 

Po{r{X)=Cc} 


if r(A) > Ca , 
if r(A) < Ca , 
if r(A) = Ca . 


Under mild continuity assumption, we take the NP oracle 

(f)*{x) = (j)*a{x) = 1{p{x)/q{x) > Ca} = ]I{r(x) > Ca} ■ 


(1.4) 


as our plug-in target for NP classification. With kerne l density estimates p, q, and a 


proper estimate of the threshold level Ca, Tong (2013) constructed a plug-in classifier 


^{p{x)/q{x) > Ca} that satisfies both NP oracle inequalities with high probability when 
the dimensionality is small, leaving the high-dimensional case an unchartered territory. 
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1.4 Contribution 


In the big data era, NP classification framework faces the same curse of dimensionality as its 
classical counterpart. Despite its wide potential applications, this paper is the first attempt 
to construct performance-guaranteed classifiers under the NP paradigm in high-dimensional 
settings. Based on the Neyman-Pearson Lemma, we employ Naive Bayes models and pro¬ 
pose a computationally feasible plug-in approach to construct classifiers that satisfy the NP 
oracle inequalities. We also improve the detection condition, a critical theoretical assump¬ 
tion first introduced in Tong (2013), for effective threshold level estimation that grounds 
the good NP properties of these classifiers. Necessity of the new detection condition is also 
discussed. Note that classifiers proposed in this work are not straightforward extensions of 


Tong (2013): kernel density estimation is now applied in combination with feature selection. 


and the threshold level is estimated in a more precise way by order statistics that require 
only moderate sample size — while Tong (2013) resorted to the Vapnik-Chervonenkis the¬ 
ory and required sample size much bigger than what is available in most high-dimensional 
applications. 

The rest of the paper is organized as follows. Two screening based plug-in NP-type 
classifiers are presented in Section where theoretical properties are also discussed. Per¬ 
formance of the proposed classihers is demonstrated in Section by both simulation studies 
and real data analysis. We conclude in Section with a short discussion. The technical 
proofs are relegated to the Appendix. 


2. Methods 

In this section, we first introduce several notations and dehnitions, with a focus on the 
detection condition. Then we present the plug-in procedure, together with its theoretical 
properties. 


2.1 Notations and definitions 

We introduce here several notations adapted from |Audibert and Tsybakov (2007). For 
/3 > 0, denote by [/3J the largest integer strictly less than /3. For any x,x' e M and any 
times continuously differentiable real-valued function g{-) on M, we denote by Qx its 
Taylor polynomial of degree [/3J at point x. For L > 0, the (/3, L, [—1, l])-Holder class 
of functions, denoted by S(/3, L, [—1,1]), is the set of functions g : [—1,1] —>• M that are 
[/3J times continuously differentiable and satisfy, for any x,x' G [—1,1], the inequality 
\g{x') — gx{x')\ < L\x — x'\^. The (/3, L, [—1, l])-Holder class of density is defined as 


Ps(/3,L,[-l,l]) = |/ ■ f >0,1 / = 1,/GS(/3,L,[-1,1])| . 


We will use /3-valid kernels (kernels of order /3, Tsybakov (2009)) for all the kernel 
estimation throughout the theoretical discussion, the dehnition of which is as follows. 


Definition 2.1. Let K{-) he a real-valued function on M with support [—1,1]. The function 
K{-) is a fi-valid kernel if it satisfies f K = 1, f \K\'^ < oo for any v > 1, f \t\^\K{t)\dt < 
oo, and in the case [/3\ > 1, it satisfies f t^K{t)dt = 0 for any I G N such that 1 < I < [fi\. 
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We assume that all the /3-valid kernels considered in the theoretical part of this paper 
are constructed from Legendre polynomials, and are thus Lipschitz and bounded, satisfying 


the kernel conditions for the important technical Lemma A .6 


Definition 2.2 (margin assumption). A function /(•) is said to satisfy margin assumption 
of order 7 with respect to probability distribution P at the level C* if there exists a positive 
constant Mq, such that for any 5 > 0, 

P{\f{X)-C*\< 6 } < M 06 P 


This assumption was first introduced in Polonik (1995). In the classical binary clas¬ 


sification framework, Mammen and Tsybakov (1999) proposed a similar condition named 
“margin condition” by requiring most data to be away from the optimal decision boundary. 


In the classical classification paradigm, definition 2.2 reduces to the “margin condition” by 
taking f = rj and C* = 1 / 2 , with {x : |/(x) — C*\ = 0} = {x : ^(x) = 1 / 2 } giving the 
decision boundary of the Bayes classifier. On the other hand, unlike the classical paradigm 
where the optimal threshold level is known and does not need an estimate, the optimal 
threshold level Ca in the NP paradigm is unknown and needs to be estimated, suggesting 
the necessity of having sufficient data around the decision boundary to detect it well. This 


concern motivated the following condition improved from Tong (2013). 


Definition 2.3 (detection condition). A function /(•) is said to satisfy detection condition 
of order 7 with respect to P (i.e., X ^ P) at level {C*, 5*) if there exists a positive constant 

Ml, such that for any d G (0,6*), 

P{C* < f(X) < C* + (5} > Mi 6 ^ . 


A detection condition works as an opposite force to the margin assumption, and is 
basically an assumption on the lower bound of probability. Though we take here a power 
function as the lower bound, so that it is simple and aesthetically similar to the margin 
assumption, any increasing u{-) on with lim 3 ;- 5 .o-i-<?(a^) = 0 should be able to serve the 
purpose. The version of detection condition we would use to establish the NP inequalities 
for the (to be) proposed classifiers takes f = r, C* = Ca, and P = Pq (recall that Pq is the 
conditional distribution of X given T = 0). 

Now we argue why such a condition is necessary to achieve the NP oracle inequalities. 
Consider the simpler case where the density ratio r is known, and we only need a proper 
estimate of the threshold level Ca- If there is nothing like the detection condition (Definition 


would have, for some 5 > 0 , 


2.3 involves a power function, but the idea is just to have any kind of lower bound), we 


Po{Ca<r{X)<Ca + 6} = 0. ( 2 . 1 ) 

In getting the threshold estimate Ca of f{x) = ll{r(x) > Ca}, we can not distinguish any 
threshold level between Ca and Ca + 6 . In particular, it is possible that 

Ca > Ca 6 l‘l . 

But then the excess type II error is bounded from below as follows 
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Ri{ct>) - = Pi{Co, < r{X) < C„} > Pi{C„ < r{X) < C, + <5/2} 


where the last quantity can be positive. Therefore, the second NP oracle inequality (dimin¬ 
ishing excess type II error) does not hold for (p. Since some detection condition is necessary 
in this simpler case, it is certainly necessary in our real setup. 

Note that Definition |2.3| is a signihcant improvement of the detection condition formu¬ 


lated in Tong (2013), which requires 


P{C* -6< f{X) < C*] A P{C* < f{X) < C* + 5} > Mi5 


We are able to drop the lower bound for the hrst piece due to an improved layout of the 
proofs. Intuitively, our new detection condition ensures an upper bound on Ca- But we do 
not need an extra condition to get a lower bound of Ca, because of the type I error bound 


requirement (see the proof of Proposition 2.4 for details). 


2.2 Neyman-Pearson plug-in procedure 

Suppose the sampling scheme is fixed as follows. 


Assumption 1. Assume the training sample contains n i.i.d. observations = {Ui, • • • , Un} 
from class 1 with density p, and m i.i.d. observations = {Vi, • • • , Vm} from class 0 with 
density q. Given fixed ni, n 2 , mi, m 2 and m 3 such that ni + n 2 = n, mi -)- m 2 + m 3 = m, 
we further decompose 5^ and into independent subsamples as: 5^ = U 5^, and 
5*^ = 5° U 5° U S^, where |5j| = ni, 1 ^ 2 1 = n 2 , |5i | = mi, | = m 2 , |531 = m 3 . 


The sample splitting idea has been considered in the literature, such as in Meinshausen 


and Biihlmann ( 

2010 

) and 

Robins et al. 

(2006 


Given these samples, we introduce the 


following plug-in procedure. 


Definition 2.4. Neyman-Pearson plug-in procednre 

Step 1 Use 5^, 51, 5°, and 5° to construct a density ratio estimate r. The specific use of 
each subsample will be introduced in Section \2P^ 

Step 2 Given r, choose a threshold estimate Ca from the set f{S^) = {riyi+mi+m 2 )}^i- 

Denote by r(fc)( 53 ) the A:-th order statistic of r{S^), A: G {1, • • • , m 3 }. The corresponding 
plug-in classifier by setting Ca = r{k){‘Si) is 


pk{x) = ll{f(x) > f(fc)(5§)}. 


A generic procedure for choosing the optimal k will be given in Section 2.3 


( 2 . 2 ) 


2.3 Threshold estimate Ca 

For any arbitrary density ratio estimate r, we employ a proper order statistic r(k){<S^) to 
estimate the threshold Ca, and establish a probabilistic upper bound for the type I error of 
(pk for each A: G {1, • • • , m 3 }. 
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Proposition 2 . 1 . For any arbitrary density ratio estimate f, let(j)k{x) = ll{f(x) > r(fc)(53)}. 
It holds for any 5 G ( 0 , 1 ) and /c G { 1 , • • • , m3} that 

^{Roi^k) > S} < Beta.cdfk^^^+i_k (1 - 5 ) , ( 2 . 3 ) 

where Beta.cdf^ „i^^j^i_^{-) is the CDF of Beta{k, ms + l — k). The inequality becomes equality 
when Fo,r(i) = Poi'f'iX) < t} is continuous almost surely. 

In view of the above proposition, a sufficient condition for the classifier to satisfy NP 
Oracle Inequality (I) at tolerance level ^3 G ( 0 , 1 ) is thus 


Beta.cdffc^mg+i-fc (1 - a) < <^3 • 


( 2 . 4 ) 


Despite the potential tightness of ( 2 . 3 ), we are not able to derive an explicit formula for the 
minimum k that satishes ( 2 . 4 ). To get an explicit choice for k, we resort to concentration 
inequalities for an alternative. 


Proposition 2 . 2 . For any arbitrary density ratio estimate f, let(j)k{x) = ll{f(x) > r(fc)(‘53)}. 
It holds for any 63 G ( 0 , 1 ) and A: G { 1 , • • • , m3} that 


JP{Ro{^k) > g{S3,m3,k)} < 63, 


( 2 . 5 ) 


where 


'h,rn3,k) = 


m3 + 1 — k 
m.o -I- 1 


+ 


/ k{m3 + 1- k) 




( 2 . 6 ) 


Let /C = /C(a, 53 ,m 3 ) = {k G {I,-- - ,m 3 } : g{ 63 ,m 3 ,k) < a}. Proposition 2.2 implies 
that k G /C(a, ( 53 , m 3 ) is a sufficient condition for the classiher (jik to satisfy NP Oracle 
Inequality (I). The next step is to characterize JC and choose some k G 1C, so that (fk has 
small excess type II error. Clearly, we would like to hnd the smallest element in 1C. 

Proposition 2.3. The minimum k G {1, • • • ,m 3 + 1} that satisfies §{ 83 , m 3 , k) < a is 

^ 3 ,'m 3 ) = [(m 3 + l)^„, 53 (m 3 )] , (2.7) 

where [z] denotes the smallest integer larger than or equal to z, and 

1 + 2 S 3 {m 3 + 2)(1 - a) + + 4 ^ 3(1 - a)a{m 3 + 2) 


^0,63 (m3) = 


2 {<53(^3 + 2) + 1} 


Moreover, 

1 . ^a,53(m3) G (1 - a, 1 ). 

^(fcmin(a,53,m3))(‘^3) asymptotically the empirical (1 — a)-th quantile of Fo^f in the 
sense that 

,. Amin( q;, ( 53 , m3) 1. A f \ ^ 

lim - = hm ^0:53(m3) = 1 —a. 

m3— >-00 7713 m3^oo ’ 
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3 . For any m3 > A/{a 5 'i), we have k ruw ioi, fe, m3) < m3, and thus 

JC{a,53,m3) = {kram{a,S3,m3),krainia,63,m3)+ 1 ,... ,1713} . 


Introduce shorthand notations k^m = knim{a, ^ 3 , " 13 ), f(fc) = r(fc)(5^), and Ca = i^(min{fcn,in,m 3 })- 
We will take 


^{x) = ]I{f(a:) > da} 


ll{f(x) > , if /Smin < m 3 , 

]I{f(x) > f(m 3 )} , if A:min = m.3 + 1 


( 2 . 8 ) 


as the default NP plug-in classiher for any arbitrary f. An alternative threshold estimate 
that also guarantees type I error bound is derived in the Appendix [C| Assume m 3 > 4/(a(l 3 ) 
for the rest of the theoretical discussion. It follows from Proposition |2.3| that fcmin < m 3 , 
and thus Ca = 4> = ^{k^in) with guaranteed type I error control. 

Remark 2.1. Note that limms^oo kra\n/\m 3 {l — a)] = 1. Thus, choosing the kmin-th order 
statistic off{S}}) as the threshold can be viewed as a modification to the classical approach 
of estimating the 1 —a quantile of F^ f. by the |'m 3 (l — a)]-th order statistic off{S^). Recall 
that the oracle Ca is actually the 1 — a quantile of distribution Pb,r; so the intuition is that 
Ca is asymptotically (when m 3 -^ 00 ) equivalent to the 1 — a quantile of Fo^r, which in turn 
converges (when ni,n 2 ,mi,m 2 00 ) to Ca as the 1 — a quantile of Fq^^ under moderate 
conditions. 


Lemma 2.1. Let a, 63 G (0,1). In addition to Assumption^ suppose r be such that Po,f is 
continuous almost surely. Then for any <54 G (0,1) and m 3 > A/{a 53 ), the distance between 
Roifi) (fi as defined in and Ro{(j)*) can he bounded as 

P{|Ro(0) - Rom\ > e«,53.m3(<54)} < < 54 , 


where 


53 ,m 3 (<^ 4 ) — 


^min (m3 +1 

^min) 


(m 3 2 ) (m 3 + 1 ) 2(14 

If m 3 > max((5^^, h^^), we have ^( 1 , 53 ,m 3 (^ 4 ) < ( 5 / 2 )m 3 . 


+ Aa,5fim3) - (1 - a) + 


1 


m3 + 1 


(2.9) 


Proposition 2.4. Let a, 63 , 84 , G (0,1). In addition to assumptions of Lemma \2.1\ assume 
that the density ratio r satisfies the margin assumption of order 7 at level Ca (with constant 
Mq) and detection condition of order j at level {Ca, 8 *) (with constant Mi), both with respect 


to distribution Pq. If m 3 > max{ 4 /(Q;( 53 ), hg ^,<54 , (gMid* ) the excess type II error of 
the classifier defined in ( 2 . 8 ) satisfies with probability at least 1 — ^3 — < 54 , 

RS)-Ri{fi*) 


^ ‘IMq 


2 ,Mq 


|^o(<^)-i?o(</>*)| 


1/1 


Ml 


+ 2||r — r| 


1+7 


+ CalRoifi) - Ro{fi*)\ 


-m^^Mi 

5 


- 1/7 


1+7 


+ 2 \\r — r\ 


+ Ca 


-m 


1/4 
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Given the above proposition, we can control the excess type II error as long as the uni¬ 
form deviation of density ratio estimate ||r — r||oo is controlled. In the following subsection, 
we will introduce estimates f and provide bounds for ||r — r||oo. 


2.4 Density ratio estimate r 

Denote the marginal densities of class 1 and 0 as pj and qj {j = 1, ■■■ ,d) respectively, Naive 
Bayes models for the density ratio take the form 


d 


rix) = 

i=i 


Pjjxj) 

QjiXj) ’ 


where Xj is the j-th. component of x . 


The subsamples Sl = {Ui}Zi, 5^ = {Ui+nAZi^ = {Vi}T=\ and 5° = {Vi+,n,}TA 
are used to construct (nonparametric/parametric) estimates of pj and for j = 1, • • • , d. 

Nonparametric estimate of the density ratio. For marginal densities pj and Qj, 

we apply kernel estimates Pjixj) = {(ni + n2)hi}~^ ^ ’ and qj{xj) = 

{(mi + ^ ’ ''^here K{-) is the kernel function, hi, ho are the 

bandwidths, and Vij and Uij denote the j-th component of Vi and Ui respectively. The 
resulting nonparametric estimate is 


r^{x) 


1=1 ' 


( 2 . 10 ) 


Parametric estimate of the density ratio. Assume the two-class Gaussian model 
X\Y = 0 ~ AA(/i^, S) and X\Y = 1 ~ , B), where B = diag((Ti, • • • , cr^). We estimate 

lY, p} and B using their sample versions jY, (Y and B. Under this model, the density ratio 
function is given by 

rp(x) = exp|(/ri-/)'B-ix+2 (l^°)'B-V°-2(/^')'5 ^”V'} , 
and the corresponding parametric estimate is 

rp{x) = exp|(/i^^(/i°)'B"^/i° - ^(/i^)'B"^/i^| . (2.11) 


2.5 Screening-based density ratio estimate and plug-in procedures 


For “high dimension, low sample size” applications, complex models that take into account 
all features usually fail; even Naive Bayes models that ignore feature dependency might 
lead to poor performance due to noise accumulation (Fan and Fan, 2008). A common 


solution in these scenarios is to hrst study marginal relations between the response and 
each of the features (Fan and Lv, 2008 Li et ah, 2012). By selecting the most important 


individual features, we greatly reduce the model size, and other models can be applied after 
this screening step. We now introduce screening based variants of tn and fp. Let Fj and 


Fj denote the cdfs of qj and pj respectively, for j = 1, ■■■ ,d. Step 1 of Procedure 


2.4 
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introduced in Section 2.1 is now decomposed into a screening substep and an estimation 
substep. 

Nonparametric Screening-based NP Naive Bayes (NSN^) classifier 


Step 1.1 Select features using 5° and S} as follows: 

X = {l <j <d: |lX-Xll~ >^} 


where r > 0 is some threshold level, and 


mi . ni 

iii\ . • 1 

2=1 2=1 


- Xj 


( 2 . 12 ) 


(2.13) 


are the empirical cdfs. 


Step 1.2 Use 5° and to construct kernel estimates of qj and pj for j G At- The density 
ratio estimate is given by 

^ _ TT Pjixj) 


xnA) = n 

j&Ar 


QjiXj 


Step 2 Given r^, use to get a threshold estimate (f^N)(A:min) ™ (2.8). 


The resulting NSN^ classifier is 


^nsnAx) = ]l{^N(a^) > (^N)(fc„,i„)} ■ (2-14) 

Parametric Screening-based NP Naive Bayes (PSN^) classifier 

The PSN^ procedure is similar to NSN^, except the following two differences. In Step 1.1, 
features are now selected based on t-statistics {Ar represent the index set of the selected 
features). In Step 1.2, pj, qj for j G At follow two-class Gaussian model, and the resulting 
parametric screening-based density ratio estimate is 


^p(a;) = n 

j&Ar 


PjjXj) 

Qjixj) ' 


The corresponding PSN^ classifier is thus given by 

^psnAx) = 1 {ff (x) > (rp)(fc,,i„)} . 


(2.15) 


We assume the domains of all pj and qj to be [—1,1] for all the following theoretical 
discussion. We will prove NP oracle inequali ties for 4‘-msn‘^^ those for (ppsN^ be 
developed similarly. Recall that by Proposition 


2.4 


we need an upper bound for ||r^ — r 


Necessarily, performance of the screening step should be studied. To this end, we assume 
that only a small fraction of the d features have marginal differentiating power. 


Assumption 2. There exists a signal set A C {1, • • • ,d} with size | A| = s <C d such that 
infjg _4 IlF? — FjWao > D for some positive constant D, and = Fj for j ^ A. 
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The following proposition shows that Step 1.1 achieves exact recovery {At = A) with 
high probability for some properly chosen r. 

Proposition 2.5 (exact recovery). Let (5i G (0,1). In addition to Assumptions^ and 
suppose ni Ami > log(4h/(5i). Then for any r G [Aq,!? —Aq], where Aq = 

Screening substep Step 1.1 (2.12) satisfies 


TP{Ar = . 4 ) > 1 - . 

Now we are ready to control the uniform deviation of density ratio estimate given in 
Step 1 . 2 . 

Assumption 3. The marginal densities pj, qj G 'Ps(/d, L, [—1,1]) for all j = 1, ■ ■ ■ , d, and 
there exists fi > 0 such that Pj,qj > /i for all j G A. There exists some constant C > Q, 




and llgf^lloo 


such that ||r||oo < C, and there is a uniform absolute upper bound for \\pj noo — u'lj 
for j G A and I G [0, [/3J]. Moreover, the kernel K in the nonparametric density estimates 
is fi-valid and L'-Lipschitz. 

Smoothness conditions (Assumption]^ and the margin assumption were used together in 
the classical classification literature. However, it is not entirely obvious why Assumption]^ 
does not render the detection condition redundant. We refer interested readers to Appendix 
m for more detailed discussion. 

when applied to pj and qj respec- 


Let Cj and Cj be the constants C in Lemma 


A .6 


tively. Assumption ensures the existence of absolute constants > supjg _4 Cj and 
> supjgyi C°. 

Proposition 2.6 (uniform deviation of density ratio estimate). Under Assumptions^ - 
1^ for any 81,82 G (0,1), if m A mi > log(4(i/(5i), ^ min(l, ^/C^), 

— ™iii(l,/i/C^), and the screening threshold r is specified as in Proposition 


2.5. we have 


P 


^lloo 


<T) > 1 - 81 - 82 , 


(2.16) 


where T = Be^\\r 


with 


B = s 


^l ! \og{2n2S / S 2 ) 

^ V 


r<0 /log( 2 m 2 s/ 52 ) 

^ V ™ 2 /l 0 


/log(2n2s/52) [\og{ 2 rnfisJ& 2 ) \ 

UAL y UALo ) 


Moreover, assume that n 2 A m 2 > I/ 82 , |A| = s < (n 2 A m 2 ) 2 (^+A, and the bandwidths 
1 1 

hi = (log n 2 /n 2 ) 2/9+1 and ho = (logm 2 /m 2 ) 2 / 9 +i, then there exists an absolute constant 
C 2 > 0 such that 


P 


rN-r\ 


< C 2 S 


' log n 2 
, n 2 


2/9+1 Z'log m 2 ^ 2/3+1 

m 2 


> 1 — hi — ^2 • 
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The condition |^| = s < (n 2 Am 2 ) in the above proposition ensures that the upper 
bound of the uniform deviation diminishes as sample sizes n 2 , m 2 go to infinity. Now we 
are in a position to present the theorem finale of NSN^. 

Theorem 2.1 (NP Oracle Inequalities for In addition to Tssumpfzons[^-[^ assume 

the density ratio r satisfies the margin assumption of order 7 at level Ca and detection 
condition of order 7 at level (Ca,d*), both with respect to Pq. For any given 61 , 62 , 63 , 64 , G 


(0,1), let the NSPfi classifier be defined as in (2.14), with the screening threshold 

r specified as in Proposition 2.5 and kernel bandwidths hi = (log 77 - 2 /n 2 ) 2 ^+i and ho = 
(log 7712 / 777 . 2 ) 2/3+1, and rff be such that Poff^ continuous almost surely. For subsample 

I l log{2n2s/&2) ^ 

j’ V 112 /u - 


sizes that satisfy ni A mi > 8 D ^ log(4d/(5i), 772 A m 2 > max{(52 s ^ 


2(/7+l) 


mm 




and 7773 > max{ 4 /(a(f 3 ), dg <54 ,{^Mi 6 * ) there exists an absolute constant C > Q 
such that with probability at least 1 — 5i — 82 — 63 — 84 , 


(I) Roifi nsn^) — 

(II) — Ri{ 4 >*) < C I 7773 


_(1 a1+7) 
14'' 43; ' 


+ S 


1+7 


log 772 
172 


/3(1+7) 

2/3+1 


+ S 


1+7 


log 7772 
m 2 


/3(1+7) 

2/3+1 


Theorem 


2.1 


establishes the NP oracle inequalities for To help understand the 

conditions of this theorem, recall that Assumption is about sample splitting. Assumption 
[^is on minimal signal strength for active feature set. Assumptionis on marginal densities 
and kernels in nonparametric estimates, and the margin assumption and detection condition 
describe the neighbourhood of the oracle decision boundary. Note that the subsample sizes 
77 i and 7771 do not enter the upper bound for the excess type II error explicitly. Instead, we 
have size requirements on them so that the important features are kept with high probability 
1 — (5i in the screening substep. The tolerance parameter <52 arises from the nonparametric 
estimation of densities, the parameter 83 is for the tolerance on violation of type I error 
bound, and 84 arises from controlling |i?o('/’NSN2) “ 


3. Numerical investigation 

In this section, we analyze two simulated examples and two real datasets to demonstrate the 
performance of our newly proposed NSN^ and PSN^ classifiers, in comparison with their 
corresponding non-screening counterparts (denoted as NN^ and PN^ respectively) as well as 
three popular methods under the classical framework: Gaussian Naive Bayes (nb), penalized 
logistic regression (pen-log), and Support Vector Machine (svm). We use R package “el071” 
for nb and svm, and the R package “glmnet” for pen-log. To facilitate the presentation, we 
summarize the four Neyman-Pearson Naive Bayes classifiers in Table 

To train the classifiers in Tablewe set a = 0.05, di = 0.05, and 83 = 0.05 throughout 
this section unless specified otherwise. In Assumption motivated by Proposition 2.5 


we 


take 7771 = min{101og(4d/(5i), 777/4} ]I(screening), 77i = min{101og(4(i/(ii), 77/2}lI(screening), 

7772 = [777/2J — 7771, 772 = 77 — 77l, and 7773 = 777 — [777/2J . 
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Table 2: A summary of the four Neyman-Pearson Naive Bayes classifiers. 



Screening-based 

Non-screening 

Non-par ametric 

<^NSN2(®) = > (^N)(fc,„in)} 

^NN^ix) = 1 I{^n(x) > (rN)(fc,,i„)} 

Parametric 

^PSN^ix) = ]l{r|(a:) > 

(^pn2(x) = ll{fp(x) > (rp)(fc,^;„)} 


Due to the absence of information with respect to the true p and q, the theoretical 
screening cutoff that achieves exact recovery is not feasible in practice. We resort to an 


empirical permutation-based approach (Fan et ah, 2011) as a substitute. Specifically, the 


screening substep in NSN^ is executed as follows: 


1. Combine 5° and 5^ into , where W G 5° U and Yi is W’s class 

label. 


2 . Calculate the marginal D-statistic for each feature: 




|i?u _ 

' 3 3 


j = l,2, 


,d, 


where F^{x) = ^ and Fj{x) = < Xj). 


3. Let vr = {'7r(l), • • • , 7r(mi + m)} be a random permutation of {1, • • • , (mi + ni)}. For 
i ^ 1, ■ ■ ■ , d, compute ^ where ^ Ecy..(.)-o ^ 

I,), = Eeo „,.i 


T(i) = 


4. For some pre-specified Q G [0,1], let ui{Q) be the Q-th quantile of : j = 

1, • • • , d} and select A = {j : Dj > (^(Q)}. Here, Q is a tuning parameter that keeps 
the percentage of noise features that pass the screening around I — Q. 


The same permutation idea is applied to the screening substep of PSN^. Q is set at 0.95 
throughout this section. 


3.1 Simulation 

Samples in both simulated examples are generated from the model 

d d 

pix)= n P3 (^3 )> =n ) 

i=i i=i 

at 3 different dimensions: d G {10,100,1000}. Sparsity for d = 100 and 1000 is imposed by 
setting pj = Qj for all j > 10. Seven different training sample sizes: m = n € {200, 400, 
800, 1600, 3200, 6400, 12800} are considered. The number of replications for each scenario 
is 1000. Test errors are estimated using the average of 1000 independent observations from 
each class for each replication. 
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3.1.1 Example 1: normals with different means 

Assume the two-class conditional densities p and q r-u M{0dJd) 

where Id is the identity matrix. At significance level a = 0.05, the oracle type I/II risks are 
i?o(0a) = 0-05 and = 0.53 respectively. 

We first evaluate the screening performance of PSN^ and NSN^ with results presented 
in Table Both t-statistic (in PSN^) and L)-statistic (in NSN^) are able to pick up most 
of the true signals while keeping the false positive rates at around 1 — Q. 


Table 3: Average screening performance summarized over 1000 independent replications at 
sample sizes m = n = 400 and Q = 0.95 with standard errors in parentheses. 


^ of selected features 

# of missed signals 

# of false positive 

d 

t-stat 

D-stat 

t-stat 

D-stat 

t-stat 

D-stat 

10 

100 

1000 

9.11 (1.14) 
14.64 (3.46) 
59.99 (9.77) 

8.11 (1.63) 
12.43 (3.38) 
58.82 (9.87) 

0.89 (1.14) 
0.78 (0.90) 
0.48 (0.66) 

1.89 (1.63) 
2.00 (1.39) 
1.14 (1.05) 

0 (0) 
5.43 (3.17) 
50.47 (9.71) 

0 (0) 
4.43 (2.77) 
49.96 (9.78) 


Figure 1: Average errors of i^i’s over 1000 independent replications for each combination of 
{d, m, n). 
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We then move on to evaluate the trend of type I and type II errors as the sample 
size increases in Figure All the Neyman-Pearson based classifiers have type I error 
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approaching a from below as sample size increases and they have similar type I errors at 
each sample size. However, nb, pen-log and svm all lead to a type I error larger than a. 

By enlarging the second row of Figure one would observe the differences in type 
II errors among PN^, PSN^, NN^, NSN^. In the case of d = 10 when all features are 
signals, PN^ performs the best throughout all sample sizes since it assumes the correct model 
without the unnecessary screening substep. When sample size is small, PSN^ outperforms 
NN^, but NN^ gradually catches up on larger samples. In the case of d = 100, screening 
helps PSN^ to take the lead at low sample sizes. The advantage of screening fades off as the 
sample size increases. In the case of d = 1000, PSN^ dominates all other three classifiers 
throughout the sample size range we investigate. 

Overall, the advantage of PSN^ over NSN^, and PN^ over NN^ are uniform across all 
dimensions and sample sizes. This is consistent with the intuition that when the data are 
from a two-class Gaussian model, the parametric methods lead to more efficient estimators 
than nonparametric counterparts. 

3.1.2 Example 2: normal vs. mixture normal 

Normality assumption is violated in the second example. Assume p ~ 0.5W(a, S) -|- 

0.5W(-a, S) and q ~ A/'(0rf,/rf), where a = Od-io)'’ ^ ^ At 

significance level a = 0.05, the oracle type I/II risks are Ro{4>a) = 0.05 and i?i((/)*) = 0.027 
respectively. 

The performance of the screening substep of PSN^ and NSN^ is shown in Table|^ While 
both screening methods keep the false positive rates at around 1 — Q, the parametric 
screening method (PSN^) with t-statistic misses almost all signals. This is not surprising 
since t-statistics rank features by differences in means and the two groups have exactly the 
same marginal mean and variance across all dimensions. 


Table 4: Average screening performance summarized over 1000 independent replications at 
sample sizes m = n = 400 and Q = 0.95 with standard errors in parentheses. 


^ of selected features 

of missed signals 

^ of false positive 

d 

t-stat 

D-stat 

t-stat 

D-stat 

t-stat 

D-stat 

10 

100 

1000 

1.76 (1.53) 
5.93 (3.44) 
50.69 (9.60) 

8.13 (1.83) 
11.96 (3.57) 
58.78 (9.87) 

8.24 (1.53) 
9.38 (0.80) 
9.50 (0.69) 

1.87 (1.83) 
2.34 (1.59) 
1.26 (1.04) 

0 (0) 
5.31 (3.17) 
50.19 (9.51) 

0 (0) 
4.29 (2.68) 
50.04 (9.62) 


Figure [^presents the average error rates. The same reason that causes the above fiasco 
of t-statistic screening reduces PSN^ and PN^ to nothing more than, if not less than, two 
unfair random coins with probability 0.05 of landing 1, while the behaviors of nb and pen- 
log bear more resemblance to that of fair random coins. This fundamental difference is 
due to that the classical framework aims to minimize the overall risk, and therefore tends 
to distribute errors evenly when the sample size for the two classes are about the same. 
The NSN^ and NN^ based on nonparametric assumptions, on the other hand, perform very 
well on non-normal data. Their difference in type II error performances are similar as in 
Example 1. 
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Figure 2: Average error rates of (p’s over 1000 independent replications for each combination 
of {d, m, n). Error rates are computed as the average of 1000 independent testing 
data points from each class in each replication, and then average over replications. 

d = 10 d = 100 d = 1000 






From the two simulation examples, it is clear that the screening-based NSN^ and PSN^ 
exhibit advantages over their non-screening counterparts under high-dimensional settings. 
When the normality assumption is violated, and the sample sizes are reasonably large 
for efficient kernel estimates, NSN^ prevails over PSN^ . As a rule of thumb, for high¬ 
dimensional classification problems that emphasize type I error control, we recommend 
NSN^ if the sample size is relatively large and PSN^ otherwise. 

3.2 Real data analysis 

In addition to the neuroblastoma dataset analyzed in the introduction, we now demonstrate 
the performance of PSN^ and NSN^ for targeted asymmetric error control on two additional 
real datasets. 


3.2.1 p53 mutants dataset 


The p53 mutants dataset (Danziger et al., 2006) contains d = 5407 attributes extracted from 
biophysical experiments for 16772 mutant p53 proteins, among which 143 are determined 
as “active” and the rest as “inactive” via in vivo assays. 


18 



























All 143 active samples and the first 1500 inactive samples are included in our analysis. 
We treat the active class as class 0 and aimed to control the error of missing an active under 
a = 0.05. This dataset is split into a training set with 100 observations from the active 
class and 1000 observations from the inactive class, and a testing set with the remaining 
observations. PSN^ is used as the representative of our proposed methods, as the class 0 
sample size is small for nonparametric methods. The average type I and type II errors 
over 1000 random splits are shown in Table Compared with pen-log, nb and svm, 
PSN^ performs much better in controlling the type I error. 


Table 5: Average errors over 1000 random splits with standard errors in parentheses, a = 
0.05, (5i = 0.05, Q = 0.95, and (Ja = 0.1. 


PSN2 

type I .019 (.028) 
type II .461 (.291) 


pen-log nb svm 

.162 (.060) .056 (.034) .484 (.222) 

.010 (.004) .458 (.033) .004 (.003) 


3.2.2 Email spam dataset 

Now, we consider an e-mail spam dataset available at https : //archive. ics. uci. edu/ml/ 
datasets/Spambase, which contains 4601 observations with 57 features, among which 2788 
are class 0 (non-spam) and 1813 are class 1 (spam). We hrst standardize each feature and 
add 5000 synthetic features consisting of independent A/(0,1) variables to make the problem 
more challenging. The augmented data has n = 4601 observations with d = 5057 features. 
This augmented dataset is split into a training set with 1000 observations from each class 
and a testing set with the remaining observations. We use NSN^ since the sample size is 
relatively large. The average type I and type II errors over 1000 random splits are shown 
in Table [H 

To evaluate the flexibility of NSN^ in terms of prioritized error control, we also report 
the performance when the priority is switched to control the type II error below a = 0.05. 
The results in Table demonstrate that NSN^ is able to control either type I or type II 
error depending on the specific need of the practitioner. 


Table 6: Average errors over 1000 random splits with standard errors in parentheses, a = 
0.05, = 0.05, Q = 0.95, and bz = 0.05. The suffix after NSN^ indicates the type 

of error it targets to control under a. 

NSN^-i?o NSN^-i?i pen-log nb svm 

type I . 019 (.007) .488 (.078) .064 (.007) .444 (.018) .203 (.013) 

type II .439 (.057) .020 (.009) .133 (.015) .054 (.008) .235 (.017) 
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4. Discussion 

The Neyman-Pearson classification framework is an important and interesting paradigm to 
explore beyond the Naive Bayes models considered in this work. For example, we can relax 
the independence assumption on PSN^, and consider a general covariance matrix. Also, 
we can consider NP-type classifiers with decision boundaries involving feature interactions. 
It is also worthwhile to study the non-probabilistic approaches under high-dimensional NP 


paradigm. Methods of potential interest include the k nearest neighbor (Weiss et al. 

20101 

and the centroid based classifiers ( 

Tibshirani et al. 

2002 

Hall et al. 

2010 

). However, the 


NP oracle inequalities are likely to be replaced by a new theoretical formulation for these 
methods. 


A benefit of the present approach is that, for any given estimator f, we have a uniform 
method to determine the proper threshold level in the plug-in classifiers. However, it would 
be interesting to develop new ways to estimate the threshold level Ca that is adaptive to 
the particular method used to approximate the density ratio r. 


Appendix A. Technical Lemmas and Proofs 

Let Bin.cdf„,p(-) denote the CDF ofBin(n,p), and Beta.cdfa^ {,(•) denote the CDF ofBeta(a,6). 
The following lemma proves a duality between the beta and binomial distributions. 

Lemma A.l (Beta-binomial duality). For any p G [0,1] and k G {1,... ,n}, it holds that 

1 - Bin.cdf^^pik - 1) = Beta.cdfk^n+i-kip) ■ 

Proof of Lemma lA.l\ Let Ui,... ,Un he n i.i.d. Uniform[0,1]. For any p G [0,1], let Np = 
Sr=i — P} denote the number of Ufs that are less or equal to p. Given 

P(lI{C/j < p} = 1) = P(Lj < p) = p, ^{Ui<p} ~ Bern(p) Vf, 

we have Np ~ Bin(?T-,p), and therefore 

P (Np > k) = 1 — P {Np < k — 1) = 1 — Bin.cdfn,p(fc — 1). (A.l) 

On the other hand, let denote the fc-th order statistic of {Ui}f^i. It follows from the 
definition of order statistics that 


{Np > k} = {at least k oi Ui,... ,Un are less or equal to p} 


Combining (A.l) with (A.2) yields 


{t/(fc)<p}. (A.2) 


1 - Bin.cdf„,p(/i; - 1) = P {Np > k) = IP {U(^k) <p) = Beta.cdffc,n-ri-fc {p) , 


where the last equality follows from ~ Beta(fc, n -|- 1 — A;) (A: = 1,..., n) as a direct 
implication of Renyi’s representation. This completes the proof. □ 


Lemma A.2. Let Z he a random variable from CDF F. We have 


Pf{F{Z)< 6} < 6, PF{F{Z)>d} >1-6 V(5g[0,1]. (A.3) 
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For continuous F, the inequality becomes equality as 


Pf {F{Z) <5] = 5, Pf {F{Z) > 5} = 1 - 5 V<5 E [0,1]. 


(A.4) 


Proof. Let ti = min{t : F(t) > 5}. Given the right continuity of F, it can be easily proved 
by contradiction that i) F{ti—) = F{ti) = <5 if F is continuous at ti, and ii) F{ti—) < 6 < 
F{ti) if F is discontinuous at ti. Thus, 

Pf{F{Z) < ,5} = Pf{Z < ti) = F(ti-) < ^. 

Likewise, let t 2 = inf {t : F{t) > (5}. We have i) F{t 2 —) = F{t 2 ) = <5 if F is continuous at 
t 2 , and ii) F{t 2 —) < 5 < F{t 2 ) if F is discontinuous at t 2 . As a result, 

Pf{F{Z) >6} = PF{Z>t2} = 1 -Pf{Z <t2} > 1 - 5 . 


This completes the proof. 


□ 


Lemma A.3. Let S = be a set n i.i.d. random variables from distribution F, and 

let denote its k-th order statistic {k = 1,... ,n). For any S E (0,1), the probability of 
a new, independent realization Z from F to be greater than Z(j.-^ satisfies 

P {Pf {Z > Z(fc) I 5) > (5} < 1 - Bin.cdfn^^_s{k - 1), (A.5) 

P {Pf {Z > Z(^i^'j I 5) < (5} > 1 — Bin.cdfn^^{n — k) = Bin.cdf^ i_s{k — 1). (A.6) 


The inequalities become equalities if F is continuous. 


Proof of Lemma A.3t Rewrite the left-hand side of (|A.5 ) as 


P{Pf(Z > Zj-fc) I 5) > 5} = W {l — Pf {Z < I 5) > 5} 

= P{1-F(Z(,))>5} = P{F(Z(,))<1-J}. 


(A.7) 


To bound the probability of {F(Z(fc)) < 1 — 5}, let A'i_,5 = ^{F{Zi)<i- 5 } denote the 

number of F(Zj)’s that are less than 1—5. It follows from F(Z(i)) < F(Z(2)) < ... < F(Z(„)) 
that 


{F(Z(fc))<l-5} = {F(Z(i))<l-5, i = l,...,k} = {N,_s>k} , 

P {F(Z(fc)) <1-5} = P {Ni_s > k) . (A.8) 


Let r = Pf {F(Zi) <1 — 5} denote the success probability of Nis as a binomial. It follows 
from (A.3) that r < 1 — 5. Given Bin.cdfn,p(A; — 1) being decreasing in p for any fixed n 
and k, we have 


^ {Ni_s > k) = 1 — Bin.cdf„^T-(fc — 1) < 1 — Bin.cdf^^ i_5(A: — 1) 


(A.9) 


as a result of The equalities hold for continuous F. 
together yields 


Gonnecting ( A.7[ ), ( |A.8 ), and ( A.9[ ) 


P{F^(Z>Z(fc) |5) >5} = 

< 


P{F(Z(fc)) < 1-5} = P(Ai_5>fc) 

1 — Bin.cdfri, i_5(/c — 1). 
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Likewise, let Mi _5 = Yll=i ^{F{Zi)>i-&} be a binomial random variable with size n and 
success rate t' = PF{F{Zi) > 1 — <5} > 5 that represents the number of F^ZiYs that are 
greater than 1 — 5. The left-hand side of ( |A. 6 [ ) can be rewritten as 

P {Pf {Z > \S)<5} =1P {F(Z(fc)) > 1 - ,5} 

= P >1 — 5, i = k,... =P {Mi _5 > n + 1 — k} 

= 1 — P {Mis < n — k} = 1 — Bin.cdfn,T'(^ ~ 

> 1 — Bin.cdfn, ^(n — fc). (A.10) 


This completes the proof. 

Proof of Proposition 2.1 , Letting Zi = fi, n = m 3 in Lemma ( |A.3 ) yields 
lP{Ro{$k) >5} < 1 - Bin.cdfmg, isi^ - 1). 

This, together with Lemma |A.lt completes the proof. 

Lemma A.4. For random variable Z ~ Beta{a,b), and any e > 0, we have 

6 e-2 


P{Z > (1 -k e)PZ} < P(|Z - EZ| > ePZ) < 
Proof of Lemma A. 41 By Chebyshev inequality, 

P(|Z-IEZ|>£lEZ)< 


(a + b + l)a 


(elEZy (a + 6 ) 2 (a + 6 + 1 ) [a + b 


-2 


be 


-2 


□ 


□ 


(A.ll) 


(fl -|- 6 “h l)fl 


□ 


Proof of Proposition Let S be a realization from Beta(A:,m 3 + 1 — A:). It follows from 
Proposition |2.1| that 


lP{Ro{(j)k) > gi53,m3,k)} < Beta.cdffc,m3+i_fc{l - ff(53,m3,A:)} 

= 1 P{B <1 - g{63,ms,k)} = P{1 - 5 > 5(63, m3, fc)} 

for any A; G { 1 , • • • , m3} and r, with 1 — B ~ Beta(m3 + 1 — k, k) . Letting a = m3 + 1 — k, 
b = k, and e = /c^/^{53(m3 + 2)(m3 + 1 — k){~^/‘^ in Lemma A .4 yields 

W{RQ{f)k) > g{52„m^,k)} < 63 , 


where 


g{5z,mz,k) 


(1 + e) 


/ m 3 + 1 - A; \ 

V m 3 + 1 ) 


m 3 + 1 — A; 

m3 + 1 


+ 


I k{m3 + 1 — k) 

dsims + 2)(m3 + 1)2 


This completes the proof. 

□ 
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Proof of Proposition \ 2 .^ By some basic algebra we have 


Aa,53{m3) - (1 - 


Aa,63{m3) - 1 = 


. _ -1 + 2a + + 4(53(1 - a)a{m 3 + 2) 

2 {53(m3 + 2) + 1} 

-1 - 253{m3 + 2)a + s/l + 453(1 - a)a{m3 + 2) 
2 {63(1713 + 2 ) + 1 } 


and 


9(63, m3, k) = 


m3 + 1 — k 


+ 


k{m3 + 1 — k) 


7713 + 1 

k - (1- a)(m3 + 1) > 0, 


53(7773 + 2)(7773 + 1) 


2 — 


< a 


{53(7773 + 2) + 1} “ {1 + 253(7773 + 2)(1 — a)} 

^ +53(7773 + 2)(1 - q;)2 > 0 


/c > (7773 + 1) max{l — a, ^0^53(7773)} 
A: > (7773 + 1)^0,53(7773). 


Thus, 


63, m3) = [(7773 +1)^0,53(7773)] 

G [(7773 + 1)^0,+("73), (m3 + 1)^0,+("73) + 1 ] • 
Since ^o,+ (^^3) —>■ 1 — a, as m3 —)■ 00, it follows from sandwich rule that 

A:niin(cki 53, 7773) A ^ \ ^ 

lim - = hm 74.053(7773) = 1—0!. 

ms^oo 7773 m3—700 ’ 

We have k m\u (a, 63, m3) G /C(a, 53,7773) (<t 4 > A:min(a, 53,7773) < m3) as long as 

777-3 — 1 

(7773 + l)Ao,53(m3) + 1< 7773 <t4> (l-a<) ^0,53("is) < -rw 

7773 + 1 


For any A G (0,a), a sufficient condition for (A.12) is 


7773-1 


7773 + 

which can be further simplified as 


j > 1 - A, ^0,53(^773) < 1 - 


7773 > ^ - 1, 7773 >x*- 2 , 


where 


X = 


-2A^ - o^ + 2aA + A + (1 - 2a)A + o^ _ A(1 - A) 


2(0 - A)253 


(a - A)253 


(A.12) 
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is the positive root of the quadratic equation 


(a - AfSlx^ + (53 (2A^ + - 2 aA - A) x - A(1 - A) = 0. 

Thus, a sufficient condition for (|A. 12 ) is 


777-3 > max 


A(l-A) _ 2_^ 


(a — A)^S3 ’ A 


Setting A = a/2 yields 


max 


/ . 1 _2 — — 15 = max 

1 (a - A)2 (53 A ^ ^ 


2 — a 
aSs 


-2,--l5 < 
a 


aSs 


Therefore, m 3 > 4 /(a 53 ) guarantees (A.12) and k min (a, 63 . m 3 ) G )C{a, 63 , 1713 ). This com¬ 
pletes the proof. □ 


Proof of Lemma \2.1\ Introduce shorthand notation let A = Aa^Ssi'^^s) (defined in Proposi¬ 
tion 2.3) and ai = (m 3 -|-l—fcmin)/("i 3 +l) for simplicity of exposition. For any Bi, B 2 G IR^, 


we have 


{\Roi^) -a\> Bi + B 2 } C {\Ro{^) - ail > .Bi} U {|ai - a| > B 2 } , 


and thus 


P{|i?o(<^) - a| > .Bi ^2 I f} 

< PllRol*^) — ail > i?i I f} -|- P (|ai — a| > il 2 I 5^) 

^min (m3 -h 1 ^min) D—2 


< 


(m 3 -h 2 )(m 3 1)2 


-B^ -|- 11 {|ai — a| > B 2 } 


(A.13) 


where the last inequality follows from applying Lemma A.4 to Ro{(f>) which follows Beta(m 3 -|- 
1 — fcmiTi, fc min ) for m 3 > 4 /(a 53 ) and continuous Ff due to Lemma A.3. It follows from 
Proposition |2.3| that 


|a — ail < A — (1 — a) -|- 


1 


m 3 -I- 1 


(A.14) 


Letting Si = and S 2 = A - (1 - a) + ^ in (|A;^ yields 


P{|So(0)-a| (<^4) I r} 

^ (54 ll(^|ai — a| ^ A — (1 — a) -1- 


1 


m 3 -I- 1 


} 


for any arbitrary f. This, together with the independence between S 3 and f (as a function 
of (S?,S/,S§,S|)) yields 

IP{|-Bo(<^) — a| > Ca,(53,m 3 (< 54 )} < <54 . 
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To establish an upper bound for ^0,53,1713(^4:), note that 

^ 0 , 53,1713 ('^ 4 ) 

_ / ^min(m-3 + 1 — A:min) ~1 + 2a + 3 JI + 4(53(1 — a)a(m3 + 2) 1 

y (m3 + 2)(m3 + 1)2(54 2{(53(m3 + 2) + 1} m3 + 1 


< 


(m 3 + 1)2/4 


(m3 + 2)(m3 + 1)2^4 2 {63(171^ + 2) + 1} 2 {63(1713 + 2) + 1 } m3 + 1 


1 


+ 


3/1 + 63(1713 + 2) 


+ 


1 


1111 

< _ , . + 7^-^ ^ . + - . 


23/171364 2171363 23/171363 m3 

When m 3 > max(5^^, (5j^), we have 

^ /r ^ 1 1 11 

^ 0 , 63 , 1713 ( 64 ) < -7TT H-TWT H-7TT H- 


2771'^'^ 2m]J‘^ 2m/^‘^ "^3 

1/1 1 

1 + - 777 + 


1/4 


5/2 


m 


1/4 


2 mf ml/ 


m 


1/4 


1/4 


-1 


This completes the proof. □ 

Proof of Proposition 2 ^. Let G* = {r < Co} and G = {f < Go}, the excess type II error 
can be decomposed as: 

Pi(G)-Pi(G*) 

= fdPi - [ dPi = f^dPo - [ ^dPo 


IG 


IG* 


Id Q 


Ig* <1 


= / (r - Go)dPo + GoPo(G) - / (r - Go)dPo - GoPo(G*) 

Jg Jg* 

= [ (r-Go)dPo- [ (r - Go)dPo + Go{Po(G) - Po(G*)} 

Jg\g* Jg*\g 

= [ \r-Co\dPo+ [ \r-Co\dPo + Go{Ro(d*)-Ro(^)} 

Jg\g* Jg*\g 


(A. 15 ) 


It follows from Lemma 


2.1 


that when m3 > max{^, <53^,54^, (|Mi( 5 * ) ^}, 

1/1 


t (7, \ ^ ^ - 1/4 , ! s:*G f 53,^3 (*^4) 1 ^ ^ r* 

io, 63 ,1113(^4) < i^m:^ ' <Mi( 5 ) , I--I <6. 

Introduce shorthand notations Ai?o = \Ro( 4 ’*) — Ro(/')\, £0 = {Ai?o < ^0,53,m3(^4)}) and 
T = Ilf — r||oo. On the event Sq, 


ARq /^^^ < [ ^»,S3,m3(64) \ 
Ml ) - \ Ml J 


< 6 *. 


By the detection condition, we have 


1 / 1 , 


± / y 

ARo < PoiGo < r(X) <Co + (ARo/Mi) } 
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Note that 


1/7 1/7 

Po{r{X) >Ca + {ARo/M,) } = Ro{(^*) - Po{Ca < r{X) < + (ARq/Mi) } 

< Ro{4>*) - ARo 

< Ro{$) = Po{r{X) > Ca} 

< Po{r{X) + T>da} = Po{r{X) > - T} . 

^ 1/7 

Thus, we have Cq. < Cq. + {ARq/Mi) + T, and 

G\G* = {r>G^,f< C«} = {r>Go,,f<C^ + {ARq/Mi)'^ + T} n {f < 

1/7 1/7 ^ 

= {Go, + (ARo/Mi) +2T>r>Ca,r <Ca + (Ai?o/Mi) + T} n {f < C„} 

1/7 

C {Ga + (ARo/Mi) +2T>r>Ca}. 

Therefore, the margin assumption implies 

Po{G\G*) < PoiCo, + (ARo/M,)^^^ + 2T>r>Ga} 

1/7 

< Mo{{ARo/M,) '^ + 2T}P 


Hence on the event Ti 


0, 


Ig\g* 


\r - Ga\dPo < {(Ai?o/Mi)'^^ + 2T]Po{G\G*) 


1/7 


< Mo{(Ai?o/Mi) +2TY+^ 


We will bound \t — Ca\ dPo on the event Si = {Ro{4>) < a}. Note that 

-Po(r > Ca) = a> i?o(0) = Poir > Co) > Po{r >Ca + ||r - r||oo) = Po{r >Ca+T) 

The above chain implies that Ga > Ga — T- Therefore, 

G*\G = {r<Ca,r>da] 

= {r <Ca,r>r-f + C«} 

C {r < Ca,r >da- T] 

C [Ga -2T <r < Ca} . 

Hence on the event Si, 

\r - Ca\dPo < 2T • Po{Ca - 2T < r < Ca) < Mo{2T)^+">, 


L 


G*\G 
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where the last inequality follows from the margin assumption. Then it follows from (A. 15) 
that on the event <fo H , 


< Mo 


< 2Mo 


1 Ml / 


+ 2T 

\/i 


1+7 


^a,5z,m3 (<^ 4 ) I 

Ml J 


+ Mo(2r)i+^ + C^\Ro{^) - Ro{(t>*)\ 

1+7 


+ 2r 


+ Ca ■ 63 ,m 3 (*^ 4 ) 


From Lemma |2.1[ we know that the event £0 occurs with probability at least 1 — 84 ^. By 
Proposition [ 2 ]^ and Proposition |2.3| we know event £i occurs with probability at least 1 — 
so £0 n £i occurs with probability at least 1 — d'i — 84 . This completes the proof. □ 


Proof of Proposition 2.5. Define event 

d 




i=i 


where = 




\F^ -F}\ 


> ||F° - F}\ 




rP _ fOi 
3 : 

> D-Sj’-ij. 


>D- iif; - r, „co 


_ II _ pl|| 
ll-f^j Hoc 


For any j 0 A, 


7?u _ pi 
Fj IlOO - 


<||7?9_pp|| _|_ II 7?P _ 7?i|| _|_ II 7?i _ p?-|| 

- 'Fj iioo \\rj iioo -r lir’^ ||oo 


— II pO _ p9\\ 

— W^j Hoc 

< <5? + 5} . 




Since ni > 8 D log{4d/8i) and mi > 8 D ^ log(4d/(5i), Jj* + <5} < D — — 8}. As a 

result, on the event £sj^, any r G [<5i + (5}, F — 8^ — (5i] would lead to At = A. Therefore, 

P(X = -A) > P(f5j 

d 

>i-Yl {iP(ll^'-^'lloo > 5 !) + P(||^°-t;°||oo > 8^4)] 

i=i 

> 1 — <5i, 


where the last inequality follows from applying Lemma A.5 to F9 and Fj for j = 1, • • • ,d. 
This completes the proof. □ 

Proof of Proposition 12.61 Define event 

^ = n 'Ll! logPi “ logpjiloo < Bj} n {|| logQj - loggjiloo < , 

jeA 
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where 


^1 = 




log(2n2s/(52) 

n 2 hi 




log(2n2s/(S2) 


5° = 


^0 / log(2m2^/i52) 
V rn2ho 


-c?/ 


log(2m2s/(52) 


712/11 ^ ^j V 7712/10 

Let = supjg_ 4 Bj and = supjg_ 4 Bj, we have B > s{B^ + B^). On the event 
{^r = -^} n f, we have 


Therefore, 


log r§i{x) 


J^log 

j&A 


Pjjxj) 

qj{xj) 


Y1 ^°SPj{xj) - log qj{xj). 
j&A jeA 


logfN - logriloo = II ^logpj - ^loggj - ^logpj + ^loggjlloo 
j&A j&A j&A j&A 

< Y (II - logPjIloo + II loggj - log^jlloo) 

j&A 

< ^(B^ +B°) < B. 
j&A 


On the event {Ar = *4.} O £, it follows from Lagrange’s mean value theorem that for any x, 
there exists some Wx between logf^(x) and logr(x) such that 

I^n(®) “ ^(^)l = |e*°®^N(^) — ei°g»'0)| = gTT'xj logf^(x) — logr(x)| 

< ell^°§^ll-+'®B = Be®||r||oo = T, 


where the last inequality follows from the fact that 

Wx < max(logr(x),logrN(^)) < max(|| logr||oo, || logr^lloo) < || logr||oo + B . 

Thus, ||r^ — r||oo < T , and we have 

IP(||rg-r||oo <T) >P({X=^}n.?) > P(X = ^) + P(-f) - 1 
= P(X =A)- P(<?'’). 


By Proposition 2.5, we have 


P(A = A) > 1-61. 


(A.16) 

(A.17) 


Also, it follows from Lemma A.6| that 

IP (II logpj - logpjiloo > Bj) VP (II loggj - log^jlloo > B°) < 52/{2s). 
Therefore, 

P(f'’) < {2s)S2/{2s) = S 2 . (A.18) 

Plugging (A.17) and (A.18) back to ( A.16[ ) yields (2.16). Moreover, because s < n 2 A m 2 , 
it follows from Lemma A.6 that there exists some C 2 > 0, such that 


B < C 2 S 


log n 2 \ ^ (log m 2 \ 2/^+1 


?^2 


m 2 


28 






























Moreover, since s < (n 2 Am 2 ) , the above bound implies that B is bounded from above 
by some absolute constant. Also note that ||r||oo is bounded from above, so there exists an 
absolute constant 6*2 > 0 , such that 


T = i?e'®||r||oo < C '2 s 
This completes the proof. 


log 71-2^ 2 '^+! ^ flogm 2 \ ^'^+1 


712 


7772 


□ 


Proof of Theorem \2.1\ Combining Propositions \2.2l \2.3\ |2.4| and |2.6 
P (/Zo(0nsn 


2 ) < a, i?i(</>NSN 2 ) < + W\ > 1 - 51 - 62 - 53 - 6 ^, 


where 


W= 2Mo 


+ Ca 


2 1/4; 

5-3 


-777q' Ml 


-m 


1/4 


.5' ^ 

This completes the proof. 


-1 


-i/j 


+ 2C*2 s 


log 772 \ 2 / 3+1 / log 7772 \ 2^+1 


772 


+ 


7772 


1+7 


□ 


Lemma A.5 (Dvoretzky-Kiefer-Wolfowitz inequality!Dvoretzky et al., 1956)). Let Xi, X 2 , 

■ ■ ■, Xn be real-valued i.i.d. random variables with CDF F{-), and let Fn{x) = 77“^ Y^^=i < 

x). For any t > 0, it holds that 


P(||F„ - F||oo >t)<2e 


-2nP 


Or, for any given 5 £ (0,1), 

P(||Fn-F||oo > (A.19) 

Lemma A.6. Given a density function p G L, [—1,1]), construct its kernel estimate 

p{x) = inh)~^Y17=iK ^) /7’13777 i.i.d. sample where the kernel K is jd-valid 

1 

and L'-Lipschitz, and the bandwidth h = (log 77/77)2 / 3+1 . For any 5 £ (0,1), as long as the 
sample size n is such that < ^dn{l, p/C), where C = \/48ci + 32c2 + 2 Lc 3 + 

L' + L + CX)i<|z|<L/ 3 j j\,, in which ci = ||p||oo||A:|||, C 2 = ||A||oo + ||p||oo + J\K\\t\ddt, 
C 3 = j\K\\t\ddt, and C is such that C > sup;^<|;|<|^^j sup 2 ,g[_]^^;^] \p^^\x)\, and p{> 0) is a 
lower bound of p, we have 


P(||logp-logp||oo > 17) < 5, (A. 20 ) 


(J /FXnJB P 

where U = --— - When n > 1/5, we have U < Ci (log 77/77) 2/3+1 for some absolute 

constant Ci. 
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Proof. Let £i = {||p-p||oo < On the event £i, since < min(l,^/C), 

we have 


min(p(a:o),j5(a:o)) > min(p(xo),p(xo) - ||p-p||oo) > /f - l|p-p||oo > 0. 

It then follows from Lagrange’s mean value theorem that for any fixed xq, there exists some 
WxQ between p{xo) and p(xo), 

I logp(xo) - logp(xo)| = \p{xo) -p(xo)| 

< [mm{p{xQ),p{xo)}]~^\p{xo) - p{xo)\ < • 

P \\P PIloo 


As a result, it holds on event £i that 


logp - logpll 

OO ^ 


c 


log(n/(5) 

nh 




\og(n/&) 

nh 


u, 


and 


P(||logp-logp||oo <u)> P(||p-p||oo < C 


login/5). 


nh 


>1-S. 


where the last inequality follows from Lemma A.l in Tong (2013) (the special case of 

„ / login/S) g 

V nh 


d = 1). Finally when n > 1/5, we have U = 
absolute constant Ci. This completes the proof. 




< Cl (log n/n) 2^+1 for some 


□ 


Appendix B. About detection condition and Assumption 

We show that it is possible for densities satisfying Assumption to violate a generalized 
version of the detection condition defined in Definition |2.3[ While the generalized detec¬ 
tion condition applies to general {P,f,C*) as the original one, we narrow its definition to 
{Pq, r, Ca) which we actually use in the main text. 

Definition B.l (Generalized detection condition). Let n(-) be a strictly increasing differen¬ 
tiable function onM'*' with lima;^o-i- u{x) = 0, a function r(-) is said to satisfy the generalized 
detection condition with respect to Pq and uf) at level (Cq,,5*) if for any 5 G (0,5*), 

Po {Ca < riX) <Ca + 5] > ui6). (B.l) 


The following conditions suffice to make ( |B.l ) fail 

Po [Ca < riX) < Ca + k-^} < uik-^), A: = 1 , 2 ,.... 


(B.2) 


A 1-dimensional toy example that satishes Assumption I and ( |B.2| ) (thus violating the 
generalized detection condition) is given as follows. Assume Pq and Pi have the same 
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support [—1,1]. Given u{-) as a strictly increasing differentiable function on M"*“ with 
lim 3 ;_ 5 .o+ u{x) = 0, let q{x) = a for all x G [0,1], and set p{x) accordingly such that 


ntri f ni) + 2^^ ^(ax), x G (0,1] , 

= -r\ = { 2u-^(l), x = 0, 

[ 2u“^(l) — x(x), xG[—1,0) 


(B.3) 


where u(-) is some positive differentiable function that makes r(-) differentiable at x = 0. 
It follows from (B.3) that {x G [—1,1] : r(x) > 2u“^(l)} = [0,1], and identity 


Po > 2 m ^(1 )} = / q{x)dx = / q{x)dx = a 

{xSf—1,1]: ?’(x)>2ii“l(l)} ■^[0,1] 

implies Ca = 2m“^(1). As a result, for any fc G {1, 2,...} we have 

{Ca<r{X) <Ca + k-^] = {X G [0,1], 2M"^(aA) < A:“^} = {X G [O, a"^M(0.5A:-^)]} 
and 

pOL~^u{0.bk~^) 

Po{Ca < ^ Ca + k~^^ = Po{X G [O, q ;“^ m ( 0 . 5 / c “^)] } = / q{x)dx 

Jo 


= a ■ a 


J0.5k ^) = u (O.SA: < u{k 


satishes (B.2). Note that the above construction makes no assumption about the behavior 
of q{-) and p{-) on [—1,0) except the normalization constraints ^pdx = qdx = 1 

and r(-) being differentiable on [—1,1]. Thus, there exist p, q, and r that satisfy Assumption 


Appendix C. An alternative threshold estimate 

This part contains an alternative estimate of threshold Ca that guarantees type I error 
bound. Based on Chernoff inequality, the following Proposition gives an alternative version 


of Proposition 2.2 First, we introduce two technical lemmas. 

Lemma C.l. If Gk ~ Gamma{k,l), k > 0, then for any t G (0,/i;), we have 
^{Gk>k + T) < W{Gk<k-T) < < g-P/o 


Proof of Lemma C.l, For any e G (0,1) and t G (0,1), it follows from Chernoff inequality 
that 

IP{Ga: > (l + e)A:} = P = (1 - t)" (C.l) 

Letting t = argmin^^g^Q ;^)(1 — = e/(l + e) in (C.l) yields 


F{Gk>{l + e)k} < (l + e)V^^ = 
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Likewise, for any e G (0,1) and s < 0, 

IP {Gfc < (1 - e)A:} = P = (1 - 

Letting s = argmin^<Q(l — = —e/(l — e) in (C.2) yields 

IP {Gk < (1 - e)k} < (1 - 

where the last inequality follows from Taylor expansion 


(C.2) 


log(l-e) + e = = V 0 <e<l. 


2=1 


2=2 


Take e = r/k, the conclusion of the lemma follows. 

Lemma C.2. Let B ~ Beta{a, b), and fj. = E(i?) = a/{a + b). For any t G (0,1 — fi), 

2‘ 


□ 


P {B > fi + t} < 2 exp 


-4 


-1 


(a + b)t 


Vb{ia + t) + y/a{l - fi-t) 


Proof of Lemma C.2. By properties of beta distribution, we can represent B as 

G 

B = —-— , where Ga ~ r(a, 1), Gf, ~ T{b, 1) are independent. 

Ga + Gb 

For any t > 0 and constant C such that a(l — fa — t) < C < b{fa + t), we have 

P(B 'Fi fa -\-1) = P {(1 — fa — t)Ga ^ (/r + t)Gb} P P {(1 — ft — t)Ga ^ G < (/x + t)Gb} 
= P{(l-/i-t)Ga <G}P{G< (^ + t)G6} 


= P G„< 


G 


1 — fa — t 


P G6> 


G 


fa + t 


= |l-P(^Ga> 

> 1 - P f Ga > 


G 


1 — fa — t 
G 


1 - P Gb > 


G 


ft + t 


I — fa — t 


- P Gb > 


G 


fa -\- t 


(C.3) 


where by Lemma |C.l 

pfGa> 


G 


p Gb< 


I — fa — t 
G 


< P Ga > a + 


fa + t 


< WWb<b- [b- 


G 

1 — fa — t 
G 


fa + t 


-aj| < e-(w-“) 

< (C.4) 


Letting 


G = 


{1 - fa - t){fa + t){aVb + ^/ab) 
Vb{fi + t) + y/a{l - fa-t) 
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in (C.3) such that the two exponents in (C.4) equal 


C 


\ — IX — t 
yields 


— a (4a) = (b — 


C 


fx + t 


(46)-^ = 4-1 


(a + b)t 


Vb{ix + t) + ^/a{l - IX-t) 


W{B>^ + t) = l — lP{B<fx + t) < P ( Ga > 


C 


1 — ^ — t 


+ P ( Gb > 


C 


^x + t 


< e 


= 2 exp 




-4 


-1 


(a + b)t 


Vb{fx + t) + ^/a{l - fx-t) 


This completes the proof. 


□ 


Proposition C.l. Let r(-) be any estimate of the density ratio function. For any 63 £ (0,1) 
and k G {1, • • • , 7713 }, the type I error of classifier 4>k defined in (2.1) satisfies 


P 


where 


h{53,m3, k) = 


^Roi^k) > Kb3,m3,k)'^ < 53 , 

1713 + 1 - k + 2-yiog (2/(^3) Vms - k + l 
m3 + 1 + 2-yiog {2/63) (y/m3 -k + 1- Vk 


Proof. Let B he a realization from Beta(fe, m3 + 1 — k). It follows from Proposition 2.1 that 
lP{Ro{$k) > Hb3,m3,k)} < Beta.cdik^rna+i-ki'i-- h{^3,rn3,k)} 

= 1P{B < 1 - h{ 63 ,m 3 ,k)} = 1P{1 - B > h{ 63 ,m 3 ,k)} 

for any k G {1, • • • , m 3 } and r, with 1 — i? ~ Beta(m 3 + 1 — k, k). Letting a = m 3 + 1 — k, 
b = k, and 


t = 


2y^log(2/53) |(m3 + 1 - k)Vk + ky/m3 + 1 - A:| 

(m 3 + 1) jms + 1 + 2y^log(2/(53) (^\/m 3 + 1 - A: - Vk^ | 


in Lemma C.2 yields 


P 


^Roi^k) > h{53,m3,k)'^ < 63. 


This completes the proof. 


□ 


Proposition |C.1| implies that h{ 63 ,m 3 ,k) < a is a sufficient condition for the classifier 
(fk (defined in ( |2.2[ )) to satisfy NP Oracle Inequality (I) {k = l,...,m 3 ). Let /Cchem = 
{A: G {1, • • • , m 3 } : h{ 53 , m 3 , k) < a}. Similar to Proposition 2.3 we can prove /Cchem to be 
non-empty as long as m 3 is greater than some threshold. 
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Numerical investigation shows that for most combinations of {a, 5^, m 3 ) with non-empty 
JC and /Cchern) fc min = min^/C as defined in (2.7) is better than fechem = minfe/Cchem in the 
sense that has a lower type II error than (j^kchem ^ result of /cmin < ^chem- Specifically, 
for each ^3 G {O.OI • the number of {^chem < ^min} out of 100 combinations of 

(a, m 3 ) G {0.01 • i}{£i X {100 • i}i£i is reported as follows. Only when <53 gets very close to 
0 is /Cchern preferred to femin- 


53 

0.01 

0.02 

0.03 

0.04 0.05 

0.06 

0.07 0.08 0.09 0.10 

ff{kdaern ^ fcmin} 

83 

70 

49 

4 0 

0 

0 0 0 0 
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